Packet switching

ABSTRACT

A method of allocating switch requests within a packet switch, the method comprising the steps of establishing switch request data at each input port; processing the switch request data for each input port to generate request data for each input port-output port pairing; comparing the number of requests from each input port and to each output port with the maximum request capacity of each input port and each output port; allocating all requests for those input-output pairs where the total number of requests is less than or equal to the maximum request capacity of each input port and each output port; reducing the number of requests for those input-output pairs where the total number of requests is greater than the maximum request capacity of each input port and each output port such that the number of requests is less than or equal to the maximum request capacity of each input port and each output port; and allocating the remaining requests. Packets may be switched from an input port to a specified output port in accordance with the allocations obtained with the above method.

[0001] This invention relates to packet switching (or cell switching),in particular methods for allocating requests for switching from one ofthe inputs of a packet switch to one of the outputs of the packetswitch.

[0002] Input-buffered cell switches and packet routers are potentiallythe highest possible bandwidth switches for any given fabric and memorytechnologies, but such devices require scheduling algorithms to resolveinput and output contentions. Two approaches to packet or cellscheduling exist (see, for example, A Hung et al, “ATM input-bufferedswitches with the guaranteed-rate property,” and A Hung et al, Proc.IEEE ISCC '98, Athens, July 1998, pp 331-335). The first approachapplies at the connection-level, where bandwidth guarantees arerequired. A suitable algorithm must satisfy two conditions for this;firstly it must ensure no overbooking for all of the input ports and theoutput ports, and secondly the fabric arbitration problem must be solvedby allocating all the requests for time slots in the frame.

[0003] Fabric arbitration has to date been proposed by means of theSlepian-Duguid approach and Paull's theorem for rearrangeablynon-blocking, circuit-switched Clos networks (see Chapter 3, J Y Hui,Switching and traffic theory for integrated broadband networks, KluwerAcademic Press, 1990). This connection-level algorithm can be summarisedas firstly ensuring no overbooking and secondly performing fabricarbitration by means of circuit-switching path-search algorithms. It hasbeen assumed that this algorithmic approach could only be applied at theconnection level, because of its large computational complexity. Forthis reason, proposals for scheduling of connectionless, best-effortspackets or cells employ various matching algorithms, many related to the“marriage” problem (see D Gale and L S Shapley, “College admissions andthe stability of marriage,” Mathematical Monthly, 69, 9-15 (1962) and DGusfield and RW Irving, The Stable Marriage Problem: Structure andAlgorithms, MIT Press, 1 989) in which the input-output connections foreach time slot or phase of the switch are handled independently, i.e. aframe of time slots (and hence phases) is not employed. Although suchalgorithms for choosing a set of conflict-free connections betweeninputs and outputs for each time slot, which are based on maximum sizeand maximum weight bipartite graph matching algorithms, can achieve 100%throughput (N McKeown et al, “Achieving 100% throughput in aninput-queued switch,” Proc. IEEE Infocom '96, March 1996, vol.3,pp.296-302) they are also impractically slow, requiring running times ofcomplexity O(N³logN) for every time slot (R E Tarjan, “Data structuresand network algorithms,” Society for Industrial and Applied Mathematics,Pennsylvania, November 1983).

[0004] Iterative, heuristic, parallel algorithms such as iSLIP areknown, which reduce the computing complexity (i.e. time required tocompute a solution) for best-efforts packets or cells (N McKeown et al,“The Tiny Tera: a packet switch core,” IEEE Micro January/February 1997,pp 26-33). The iSLIP algorithm is guaranteed to converge in at most Niterations, and simulations suggest on average in fewer than log₂Niterations. Since no guarantees are needed, this and similar algorithmscurrently represent the preferred scheduling technique forconnectionless data at the cell level in input-buffered cell switchesand packet routers with large numbers of ports (e.g. N≧10). The iSLIPalgorithm is applied to the Tiny Tera packet switch core, which employsVirtual Output Queueing (VOQ), in which each input port has a separateFIFO (First In, First Out) queue for each output, i.e. N² FIFOs for anN×N switch. If we assume that each FIFO queue stores at least a numberof cells equal to the average cell latency L, and that each cell is a53-byte ATM cell, then the total input FIFO queue hardware count isO(424LN²). With each element capable of clocking out 424f bits perframe, this is a complexity product of O((424)²fLN²), which is a verylarge complexity. Fortunately, by employing a single queue in the formof RAM in each port, acting as N virtual queues, the hardware count canbe reduced to O(424LN), and with parallel readout reducing the number ofsteps per frame to just f, the overall complexity product can be reducedto O(424fLN). Table 1 gives the hardware and “computing” steps for thesequeues to provide f cells within a frame.

[0005] For unicast packets the iSLIP algorithm converges in at most Niterations, where N is the number of input and output ports. On averagethe algorithm converges in fewer than log₂N iterations. The physicalhardware implementation employs N round-robin grant arbiters for theoutput ports and N identical accept arbiters for the input ports. Eacharbiter has N input ports and N output ports, making N² linksaltogether. The total amount of hardware depends on the preciseconstruction of the round-robin arbiters. N McKeown et al, op cit,employ a priority encoder to identify the nearest request from the portclosest to a pre-determined highest-priority port (see FIG. 1). Thepriority encoder reduces the number of links down to log₂N parallellinks, in order to change the pointer if required. The log₂N parallellinks are then expanded back up to N links again through a decoder.Details of the hardware complexity of the arbiters are given in NMcKeown, Scheduling Algorithms for Input-Queued Cell Switches, PhDThesis, University of California, Berkeley, 1995. The growth rate forthe complete scheduler is O(N⁴), each arbiter being O(N³). For a 32×32cell switch (which is the size of the Tiny Tera switch), 421,408 2-inputgates are required. This may be quite acceptable for such a smallswitch, but the O(N⁴) growth rate is extremely large.

[0006] In order to minimise the overall hardware and computingcomplexity, the best structure for constructing the encoder is a binarytree, which requires O(2N) elements (for large N) and only log₂N stepsper iteration, whilst the decoder needs only O(N) elements. Pipeliningcannot be employed to reduce to one step per iteration, because thepointers cannot be up-dated until the single-bit requests have passedthrough both sets of arbiters to the decision register. The totalhardware and computing complexities are given below in Table 1. Thehardware complexity now grows as O(N²) rather than O(N⁴), due to thebinary tree encoder and decoder. TABLE 1 Hardware and computingcomplexities of the iSLIP algorithm for scheduling f packets per port ina frame of f time slots. Hardware Computing Steps per Hardware.ComputingCount Frame Complexity Product Input RAM 424LN f 424fLN queues AverageO(6N²) O(4flog₂N(1 + O(24fN²log₂N(1 + Convergence log₂N)) log₂N))Guaranteed O(6N²) O(4fN(1 + log₂N)) O(24fN³(1 + Convergence log₂N))

[0007] The overall hardware.computing complexity productO(24fN³(1+log₂N)) of the iSLIP algorithm for scheduling f packets perport would be no less than that of the maximum size and weight matchingalgorithms of N McKeown, et al “Achieving 100% throughput in aninput-queued switch,” Proc. IEEE Infocom '96, March 1996, vol.3,pp.296-302., if convergence must be guaranteed. There is a reduction toO(24fN²log₂N(1+log₂N)) for the average number of computing steps. Themajor benefit of the iSLIP algorithm is its parallel nature, whichallows the number of computing steps to be traded against hardwarecomplexity, thus reducing computing times by a factor N² at the expenseof increasing the hardware by the same factor. It is interesting to notethat hardware quantities for the input RAM queues far exceed thoseneeded for the scheduling electronics.

[0008] According to a first aspect of the invention there is provided amethod of a method of allocating switch requests within a packet switch,the method comprising the steps of

[0009] (a) establishing switch request data at each input port;

[0010] (b) processing the switch request data for each input port togenerate request data for each input port-output port pairing;

[0011] (c) comparing the number of requests from each input port and toeach output port with the maximum request capacity of each input portand each output port;

[0012] (d) allocating all requests for those input-output pairs wherethe total number of requests is less than or equal to the maximumrequest capacity of each input port and each output port;

[0013] (e) reducing the number of requests for those input-output pairswhere the total number of requests is greater than the maximum requestcapacity of each input port and each output port such that the number ofrequests is less than or equal to the maximum request capacity of eachinput port and each output port; and

[0014] (f) allocating the remaining requests.

[0015] According to a second aspect of the invention there is provided amethod of allocating switch requests within a packet switch, the methodcomprising the steps of;

[0016] (a) establishing switch request data at each input port;

[0017] (b) processing the switch request data for each input port togenerate request data for each input port-output port pairing;

[0018] (c) allocating a first switch request from each of the inputport-output port pairing request data, the requests being allocated onlyif the maximum request capacity of the respective output port has notbeen reached; and

[0019] (d) allocating further switch requests by the iterativeapplication of step (c) until the maximum request capacity of eachoutput port has been reached.

[0020] The present invention additionally provides a method ofallocating switch requests within a packet switch, the method comprisingthe steps of;

[0021] (a) establishing switch request data at each input port;

[0022] (b) processing the switch request data for each input port togenerate request data for each input port-output port pairing;

[0023] (c) identifying a first switch request from each of the inputport-output port pairing request data;

[0024] (d) identifying further switch requests by the iterativeapplication of step (c) until all of the switch request data has beenidentified;

[0025] (e) subject to the maximum request capacity of each input portand each output port, allocating all of the identified switch requests;and

[0026] (f) reserving unallocated switch requests for use in the nextphase of switch request allocation.

[0027] The invention will now be described with reference to thefollowing figures in which;

[0028]FIG. 1 is a schematic depiction of a known arrangement forallocating switch requests;

[0029]FIG. 2 is a schematic depiction of an apparatus for countingswitch requests according to the present invention;

[0030]FIG. 3 is a schematic depiction of a second apparatus for countingswitch requests according to the present invention;

[0031]FIG. 4 is a schematic depiction of an apparatus for counting theswitch requests for each output port of a packer switch according to thepresent invention;

[0032]FIG. 5 is a schematic depiction of an apparatus for counting theswitch requests for each output port of a packer switch according to analternative embodiment of the present invention;

[0033]FIG. 6 is a graph comparing the performance of the iSLIP algorithmwith that of the present invention;

[0034]FIG. 7 is a graph showing the performance ratio of the iSLIPalgorithm to that of the present invention;

[0035]FIG. 8 is a second graph comparing the performance of the iSLIPalgorithm with that of the present invention; and

[0036]FIG. 9 is a second graph showing the performance ratio of theiSLIP algorithm to that of the present invention.

[0037] As the scheduling of best-effort, connectionless cells within aninput-buffered switch, router or network is bring considered, each ofthe input ports could be assumed to have a FIFO queue, each of which isdestined for a different output port (i.e. virtual output queueing—VOQ).Although the flows are best-effort, we wish to schedule them on aframe-by-frame basis. However, there is no pre-reservation of time slotswithin this frame, thus a number f of cells are queued at each inputport, in f time slots, and are being scheduled according to their outputport destinations in such a way as to avoid contention. A particularcell or packet should be able to be transmitted across the switch fabricduring any one of f time slots. Before performing fabric arbitration toensure that there is no output port contention in each time slot, wemust first make sure that there is no overbooking of the input andoutput ports within the frame.

[0038] If the total number of cells Nf (where N is the number of inputports and f is the number of time slots in a frame) to be switchedacross a cell or packet switch are to be computed together on aframe-by-frame basis, by means of a path-searching algorithm for a3-stage circuit switch, then every cell could be represented as a porton the circuit switch. The number of computing steps needed to ensure nooverbooking then depends on the amount of hardware that is acceptable.If O(fNlog₂(fN)) components are used, then O(fN) computing steps areneeded. The number of computing steps can be reduced to O(log₂ ²(fN)) ifmore hardware is acceptable, i.e. O(fNlog₂ ²(fN)), using a Batchersorting network, but this hardware quantity may be too large to beacceptable. However, to represent every cell as a port on a circuitswitch is an over-restrictive constraint in a cell switch. In fact, in acell switch, it is only necessary to ensure that the number of cellsdestined for each of the N output ports does not exceed the number ofcells or time-slots in the frame as there is no requirement to exit theoutput port in any specific time-slot.

[0039] The present invention concerns a method of for ensuring nooverbooking of the input and/or output ports. An N×N request matrix R isdefined, whose elements r_(i,j) represent the number of cells in inputport i destined for output port j. The two conditions that ensure nooverbooking are simply:${\sum\limits_{j = 1}^{N}\quad r_{i,j}} = {{f\quad {for}\quad {all}\quad i\quad {and}\quad {\sum\limits_{i = 1}^{N}\quad r_{i,j}}} = {f\quad {for}\quad {all}\quad j}}$

[0040] In practice, cells from more than f time slots in each input portcould be considered in this procedure, if cells destined for overbookedports have to be discarded. Discarded cells could either be lostcompletely, or continue to be queued for later attempts.

[0041] In order to establish the number of request matrix elements, N²counters are established, one for each queue. FIG. 2 shows a schematicdepiction of a possible input port arrangement for counting the requestmatrix elements, r_(i,j). Each of the N input ports 10 to a switchfabric 20 has N FIFO queues 11, N counters 12 and N switches 13 in orderto direct cell requests to the appropriate counter 12. Assuming just fcells in each port are counted within the request matrix, each counterthen requires log₂(f+1) counting stages, requiring O(N²log₂f) counterelements altogether. If it is assumed that the individual celldestination requests are input to these counters as single bits, therewill be a maximum of O(f) computing steps required of any counter,giving an overall hardware.computing complexity product for the countersof O(fN²log₂f), FIG. 2 shows that we also require O(N) switches in eachinput port to direct the cell requests to the correct counter, i.e.O(N²) in total. The speed of these switches must be sufficient that,within one frame of f slots, flog₂N bits can be routed. The overallcomplexity product for the switches is therefore fN²log₂N.

[0042] The method of queueing cell requests can be refined further. InFIG. 2, each port has N FIFO queues, each capable of buffering up to fcells. This requires O(fN²log₂N) buffer elements, each capable oftoggling flog₂N times per frame, i.e. a complexity product ofO(f²N²(log₂N)²). Fortunately, by employing a single queue in the form ofRAM in each port, acting as N virtual queues, the hardware count can bereduced to O(fNlog₂N), with the same number of steps per frame,requiring a complexity product of O(f²N(log₂N)²) overall. Since aparticular cell stored in RAM will be allocated to any of the f timeslots within the frame, re-ordering may now also be required in anoutput queue if it is desired to preserve the cell order between inputand output ports. This is now in essence the same as traditionaltime-slot interchanging in time-shared circuit switches. However, itwould be possible to preserve the cell order of a virtual input queue byallocating time slots in time order to the cells destined for the sameoutput port. Even with efficient buffering in a single queue within eachinput port, cell buffering is the most complex function in a switch,requiring the largest quantity of the fastest electronics. The cellshere are just the request cells, containing only output port destinationaddresses (and possibly input port addresses as well as other parametervalues). Much greater buffering complexity is needed for the actualcells or packets carrying all the header information and payload.

[0043]FIG. 3 shows an improved arrangement for counting the requestmatrix elements. A serial input stream of cell requests is converted toa parallel word, which is then transmitted over a parallel bus 31. Eachline of the parallel bus is connected via gates 32 to sections of theRAM 33, each RAM section 34 holding an individual cell request of log₂Nbits. Each RAM section can both read and write cell requests from and tothe parallel bus. As the cell requests are written into RAM sections 34,they are also decoded into single-bit requests by the decoder 35 andtransmitted to the array of counters 36. Each input port requires anarray 36 having N counters, so the requirement for an N×N switch is N².The overall complexity product, which was previously dominated by theRAM queues, is now reduced by a factor log₂N, which for large N could bean order of magnitude. This reduction is in terms of RAM access speeds,rather than quantity of buffer memory.

[0044] Once all of the matrix elements have been counted, the next stepof the method of the present invention is to add the request matrixelements r_(i,j) to form the sum of the requests for each output port ofthe switch. If more than f requests are taken into account, then the sumof the requests from each input port must also be calculated. FIG. 4shows an array of N² counters 41, each containing the number of requestsfor switching from a given input port to a given output port, e.g.counter [1,2] holds requests for switching from the first input port tothe second output port and counter [1,N] holds requests for switchingfrom the first input port to the Nth output port. The outputs from thecounters 41 feed into an array of N adders 42, with each addercorresponding to one of the output ports of the switch fabric. Thus, theoutputs of the counters which hold requests for switching to the firstoutput port all connect to the input of the adder 42 which is associatedwith the first output port of the switch fabric. The counts may berepresented as log₂f-wide words, each of which must be switchedsuccessively to the adder circuitry 43 by an associated switch array 44.For conventional adder constructions the software and hardwarecomplexities are no greater than for counting the individual requestmatrix elements.

[0045] The third step of the method of the present invention is tocompare the summations for each output port with f, which is the maximumnumber of cells that can be sent from each input port and to each outputport in each frame. If any row or column of the request matrix exceedsf, the number of requests must be reduced to no more than f. One methodof achieving this is to reduce the number of allowed requests to anumber proportional to the actual number of requests, i.e.$r_{i,j}^{\prime} = {\frac{f}{\sum\limits_{i}\quad r_{i,j}} \cdot r_{i,j}}$

[0046] This step is efficient when the there is a heavy concentration ofrequests on one, or a few, input or output ports. However, if thetraffic is distributed uniformly from inputs to outputs, such that eachelement has a small number of requests, then the method is lessefficient. In an alternative embodiment of the present invention, therequests are allocated in the following manner. Allocation of requestsmust be performed fairly, so that every request has a chance of beinggranted within the frame, while satisfying the “no overbooking”condition, to keep the average delay seen by individual requests low.Furthermore, every virtual output queue must have a chance of beinggranted a request within the frame, to prevent starvation. To fulfilthese requirements, each r_(i,j) from all input ports must be granted atleast one request, if they ask for one or more requests. as one port,can not be granted a large number of requests while other ports aregranted none.

[0047] The tasks of summing requests and then reducing the number ofrequests as necessary are now replaced by a single mechanism whichiteratively counts up the requests by granting one at a time to allr_(i,j) counters, until the sum of the requests destined for any outputport equals the number f of slots in the frame. At that point no morerequests can be granted to that output. On the first iteration, allrequest matrices, r_(i,j), with one or more requests for a given outputport j are granted one of these requests, so that up to N requests maybe granted in the first iteration (assuming here that f≧N). Each of thenon-zero request matrix, r_(i,j), counters is now decremented by one,(i.e. a 1 is subtracted from each log₂f-wide word) ready for the nextiteration. Meanwhile, the successful requests granted to each outputport, which are indicated as single bits in parallel from each r_(i,j)counter, are summed, for example by an adder with parallel input ports,converting the individual granted request bits into a word of lengthlog₂(2N) bits.

[0048]FIG. 5 shows a circuit which can be used to implement thealternative embodiment of the invention. The circuit includes an arrayof N² request matrices 41 (as shown in FIG. 4 and described above) and Nadding elements 51. The adding elements 51 each comprise a parallel bitadder 52 which receives in parallel the single bit requests from eachassociated request matrix 41 and creates a log₂(2N)-wide word. Theoutput of each parallel bit adder 52 is connected to a respectivelog₂(2f)-wide adder 53.

[0049] There are N such parallel adders 52, each of which requires 4Nbinary adders. The number of steps required in each iteration to obtainthe sum of the requests is log₂N.log₂(2N)/2. Those r_(i,j) counters thathave a second request for a given output port have a second single bitsummed through the parallel adders in a second iteration of the processdescribed above. Altogether there could be as many as f iterations, ifonly one input port has requests destined for any one of the outputports. At the other extreme, for a uniform traffic distribution where agiven output receives cells from different inputs, there could be justone iteration required. Because there is no buffering within theparallel adders, pipelining can be employed to sum each of theiterations. The maximum number of steps required for a maximum of fiterations is (f+log₂N.log₂(2N)/2). The complexities are summarised inTable 4.

[0050] At the output of each parallel adder 52 there is a temporalsuccession of up to f log(2N)-wide words. The successive words must alsobe added sequentially by the adder 53 to obtain the overall total numberof requests granted to an output port. Since this number cannot exceed f(i.e. the number of time slots in the frame) this can be done using alog₂(2f)-wide adder. The adder construction in FIG. 5 employs O(log₂²(2f)) half adders. Alternatively, a conventional adder constructionemploying log₂(2f) full adders could be used. The succession of f,log(2N)-wide words can be stored in a buffer of size f.log₂(2N), ifnecessary. These can then clocked out sequentially at a suitable ratevia switches for the following log₂(2f)-wide adder. The log₂(2f)-wideadder 53 calculates the overall total number of requests granted to allinput ports destined for a particular output port (and this total mustnot exceed f). On the iteration that drives the overall total above f,the counting must be stopped for that output port. Each request matrix,r_(i,j), counter whose cell requests are destined for that output portmust be advised and keep a record of the number of iterations it tookpart in. It will only be allowed a number of requests equal to thenumber of iterations for which the overall count of requests for a givenoutput port is ≦f. If the overall total on the previous iteration isless than f, then up to N additional requests may be allocated up to thetotal of f (when f≧N). This can simply be done by examining the statusof each request matrix, r_(i,j), counter in turn in a maximum of Nsteps. When f<N, then up to f additional requests may be allocated. Thismay also take a maximum of N steps, by examining the status of eachrequest matrix counter in turn, depending on the locations of therequests distributed around the N counters. Each of the request matrixcounters now knows how many requests have been allocated to it. Tomaintain fairness between the input ports destined for a particularoutput port, a pointer may be used, so that additional requests can beallocated preferentially to different input ports in different frames.(Of course, a specific pattern of requests could be such that the sameinput ports are in fact allocated additional requests in differentframes). There are many ways in which a pointer could be used toallocate additional requests, including existing round-robin techniques.The simplest way would be to cycle the pointer continuously around theinput ports, allocating one request at a time to any requesting port,stopping when the required number of additional requests up to N hasbeen granted. The next frame that subsequently needs to allocateadditional requests then begins from that pointer position.

[0051] It would be possible to reduce the N steps required to allocatethe additional requests to O(log₂N) steps, by using an (N,N)concentrator to pack requests next to each other, so that only therequired number up to N of those concentrated requests can be gated orswitched through to the parallel bit adder. The concentratorconstruction should preferably be such that the relative positions ofthe requests are preserved at the output of the concentrator, like thatin T Szymanski, “Design principles for practical self-routingnon-blocking switching networks with O(NlogN) bit-complexity,” IEEETrans. On Computers, vol.46, no.10, 1057-1069 (1997). The gating orswitching of the concentrator outputs could be achieved by means of amodified log₂N-stage decoder, which not only produces, for example, alogic “1” on the appropriate numbered decoder output port (controllingthe concentrator output port), to represent the last of up to N ports tobe allowed through to the adder, but also propagates through its log₂Nstages a logic “1” to all decoder outputs (and controlled concentratoroutputs) that lie above the last of up to N ports. In this way alldecoder output ports above and including the decoded one provide enablebits (“1”s) to the concentrator output ports that they control, and allthe decoder ports below the decoded one provide disable bits (“0”s) tothe concentrator output ports that they control. Because the pointercould be in any of the N counter positions, and we wish preferably toallow through requests starting from the pointer position, the processof allocating additional requests could be split into two steps.

[0052] In the first, only the requests including and below the pointerposition are sent to the concentrator for gating or switching through tothe adder. Disabling of the requests above the pointer position can beperformed by a similarly modified decoder. In the second part, onlythose requests lying above the pointer position are allowed through tothe concentrator. The overall hardware complexity required for theconcentrator and decoder using this construction is O(3N²log₂N), and thenumber of computing steps using Szymanski's concentrator construction isO(8log₂N).

[0053] In a further alternative embodiment of the present invention, thecounting and summing of requests could be performed as described aboveabove, except that the counting and summation may continue beyond theiteration that drives the overall total of requests above f for aparticular output port, so that the complete total of requests can becounted. The log₂(2f)-wide adder would now need to be a log₂(2fN)-wideadder, with a hardware count of Nlog₂(2Nf).log₂(4Nf)/2; computing stepsO(flog₂(2Nf)); and overall complexity product O(Nflog₂²(2Nf).log₂(4Nf)/2). Furthermore, it could also be possible for thecounters to hold cell request counts from previous frames, and to addthese to the counts of each new frame. This would allow running totals,or perhaps weighted averages, to be used covering longer time-slots thana single frame, if cell queue lengths employed are longer than a framelength. The number of cell acceptances between input and output portscould then be calculated on perhaps a fairer basis related tolonger-term flows between input and output ports.

[0054] Indeed, whether the actual queue lengths employed are longer thanthe frame length or not, i.e. even if unsuccessful cells are discardedwithin each frame, it may be advantageous for the number of cellacceptances within each frame to be related to such a longer-termmeasure of traffic flow requests between ports, rather than just thecell requests within the frame itself. Of course the number of cellacceptances could also be related to a combination of longer-term flowand “within frame” requests.

[0055] Once the number of requests r_(i,j) allocated between each inputand output port within a frame is known, the individual cell requestsbuffered in the RAM queues must be identified. This can be achieved byre-running the individual cell requests out of the RAM queues, throughthe decoder switches and re-counting them in the request matrixcounters. This time, each counter is set at its allocated total, and canbe decremented by each cell request bit that it receives. While thecounter is still above zero, a single bit (e.g. a “1”) is sent back tothe RAM queue to the cell position corresponding to the current requestbit, signifying cell acceptance. When a cell request bit decrements thecounter to zero, and for all subsequent request bits, a single bit (e.g.a “0”) is sent back to the appropriate RAM queue cell position tosignify non-acceptance of that particular cell request. The status(accepted or rejected) of all cell requests stored in RAM queues is nowestablished. The hardware and computing complexities are the same as forthe first count of matrix elements r_(i,j), except that an additionalN(N+f) switches are needed. The additional number of steps requiredthrough these is f. All accepted cell requests are now ready to havetheir time slots calculated for transmission across the switch fabric(fabric arbitration).

[0056] Excluding the necessary cell or packet buffering in RAM queues,for which there is no essential difference, the hardware.computingcomplexity product of the method of the present invention is O(log₂N)smaller than iSLIP for all hardware items. Taking the worst valuesacross all hardware items, again excluding cell buffering, this actuallytranslates into fewer computing steps than iSLIP by the same factorO(log₂N), but at the expense of greater hardware quantities by the samefactor O(log₂N) for some of the hardware items. Nevertheless, thehardware quantities required for the method of the present invention aremuch smaller than the cell buffering hardware.

[0057]FIG. 6 shows the number of computing steps for the iSLIP algorithmwith average convergence [solid line] and the method of the presentinvention [dashed line], for a small switch with N=32 input and outputports, as a function of the number of time slots f under consideration.FIG. 7 shows the ratio of computing steps for the two algorithms, whichreaches a minimum of 0.32 for f=32 time slots. Although this means thatthe method of the present invention is more-than three times as fast analgorithm as iSLIP, it does require cells to be buffered for 32 timeslots to achieve this minimum ratio. However, the ratio is around 1/3for all numbers of time-slots from 8 upwards, so any desired celllatency could be chosen. The benefits of a frame-based algorithm becomemore significant for switches with more ports N. FIGS. 8 and 9 show theequivalent graphs as FIGS. 6 and 7 for the case where N=256 ports. Here,the method of the present invention takes only 0.195 times the iSLIPcomputing time at minimum, requiring an optimum f=192 time slots. Onceagain, the practical number of time slots can be anything from 64upwards, yet still provide around a 5-fold speed advantage. Thussignificant computing time reductions are achievable even if the numberof time slots f employed is made equal to the number of ports N.

[0058] The overall hardware.computing complexity product of the methodof the present invention is of order N times smaller than a maximumweight matching algorithm. In comparison, the iSLIP algorithm havingaverage convergence is only O(N/log₂N) smaller.

1. A method of allocating switch requests within a packet switch, the method comprising the steps of (a) establishing switch request data at each input port; (b) processing the switch request data for each input port to generate request data for each input port-output port pairing; (c) comparing the number of requests from each input port and to each output port with the maximum request capacity of each input port and each output port; and (d) allocating all requests for those input-output pairs where the total number of requests is less than or equal to the maximum request capacity of each input port and each output port; (e) reducing the number of requests for those input-output pairs where the total number of requests is greater than the maximum request capacity of each input port and each output port such that the number of requests is less than or equal to the maximum request capacity of each input port and each output port; and (f) allocating the remaining requests.
 2. A method of allocating switch requests within a packet switch, the method comprising the steps of; (a) establishing switch request data at each input port; (b) processing the switch request data for each input port to generate request data for each input port-output port pairing; (c) allocating a first switch request from each of the input port-output port pairing request data, the requests being allocated only if the maximum request capacity of the respective output port has not been reached; and (d) allocating further switch requests by the iterative application of step (c) until the maximum request capacity of each output port has been reached.
 3. A method of allocating switch requests within a packet switch, the method comprising the steps of; (a) establishing switch request data at each input port; (b) processing the switch request data for each input port to generate request data for each input port-output port pairing; (c) identifying a first switch request from each of the input port-output port pairing request data; (d) identifying further switch requests by the iterative application of step (c) until all of the switch request data has been identified; (e) subject to the maximum request capacity of each input port and each output port, allocating all of the identified switch requests; and (f) reserving unallocated switch requests for use in the next phase of switch request allocation.
 4. A method of packet switching wherein the input port-output port routing is allocated according to the method of any of claims 1-3 and the packets are switched on the basis of the allocated routing.
 5. A packet switch in which switch requests the input port-output port routing is allocated in accordance with the method of any of claims 1 to
 3. 6. A packet switch according to claim 5, wherein packets are switched from an input port to a specified output port in accordance with the allocated routing. 